[Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses.#27123
[Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses.#27123mgoin merged 16 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
💡 Codex Review
https://github.com/vllm-project/vllm/blob/474381baec872bc9f45e221754d420f41b93ace0/vllm/model_executor/layers/fused_moe/layer.py#L2115-L2118
Accessing missing fused_experts attribute
The commit removes fused_experts from FusedMoEMethodBase, but FusedMoE still unconditionally accesses self.quant_method.fused_experts here (and again later when staging tokens). When the quant method does not use a modular kernel—e.g. AWQ, BitsAndBytes, RTN—init_prepare_finalize now leaves the original quant method in place and it no longer defines a fused_experts attribute. These checks will therefore raise AttributeError before any routing happens. The guard should use hasattr or using_modular_kernel instead of dereferencing the attribute directly.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request refactors the handling of modular kernels for Fused MoE layers by introducing a FusedMoEModularMethod wrapper. This is a good simplification that centralizes logic. However, I've identified two critical issues that could lead to runtime errors. One is related to an incorrect condition for EPLB support in the FP8 quantization method, and the other is an incorrect API usage for submodule replacement. I have provided detailed comments and suggestions to address these issues.
4e996e7 to
1d40f7f
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
2c03071 to
4d8d68f
Compare
4d8d68f to
de74947
Compare
|
Thanks @bnellnm . This cleans up a bunch of redundant code 🙌 . I have a suggestion. IIUC, the function call chain for the construction of Here, note that My suggestion is to move That way, we can get rid of most of the ModularKernel specific code from What do you think ? I am not suggesting we do it in this PR. I can take it up as well 👍 |
Yeah, that's a good idea. I was also considering splitting up layer.py in different ways, e.g. move UnquantizedMoEMethod to a separate file. I'd rather do that in a separate PR though. |
1e92608 to
64eed70
Compare
| if layer.w2_weight is None | ||
| else layer.w2_weight | ||
| ) | ||
| assert all([w is not None for w in [layer.w13_weight, layer.w2_weight]]) |
There was a problem hiding this comment.
I think this setting of layer.w13_weight and layer.w2_weight better fits in the process_weights_after_loading function here
That way we can get rid of having to differentiate between the
w13_weight_triton_tensor/w2_weight_triton_tensor and w13_weight/w2_weight .
Not suggesting for this PR. Fixing that I think should be its own PR.
varun-sundar-rabindranath
left a comment
There was a problem hiding this comment.
LGTM ! Very nice cleanups ! Thanks @bnellnm
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
64eed70 to
5830b98
Compare
|
@bnellnm I have a ... maybe dumb ... question - how exactly is each derived |
The So, subclasses of |
Purpose
Make a new
FusedMoEModularMethodsubclass ofFusedMoeMethodBasefor use with modular kernels.Instead of having every subclass of
FusedMoEMethodBasecheckself.fused_experts, we swap out thequant_methodof theFusedMoElayer to an instance ofFusedMoEModularMethod. This will reduce the complexity of the variousFusedMoEMethodBasesubclassapplymethods and isolate uses of modular kernels to the new class.Test Plan
Ran by hand on some fp8 + modelopt models.
CI tests
Test Result
cc @varun-sundar-rabindranath , @wenscarl